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1 Introduction 



Linear algebra over small finite fields has many direct applications, such as cryptogra- 
phy and coding theory. Other applications include efficient linear algebra over the ra- 
tionals, e.g., by reducing such computations to a series of computations modulo small 
primes, and solving non-linear systems of equations using Grobner bases ifTSll . For the 
latter recent work has emphasised that the right choice of linear algebra algorithms and 
implementations can make a significant impact on the performance of Grobner basis 
algorithms ifTOl . 

Compared to other small finite fields, those with even characteristic have some special 
properties which make them a prominent choice for designing cryptographic and cod- 
ing systems - cf., the AES |8| as a prime example. For instance, addition is simply 
XOR and is hence natively available on modern CPUs, the same cannot be said for 
other small finite fields. Yet, this family of finite fields has not received much atten- 
tion in the literature on linear algebra. However, that the current state of the art in the 
literature leaves something to be desired can be observed from the following simple 
benchmark: Computing the (reduced) row echelon form of a random 4, 000 x 4, 000 
matrix over F4 on a 2.66 Ghz Intel i7 CPU takes 5s using GAP 4.4.12 fTTl (not re- 
duced) or even 876.6s and 2805.8s using NTL 5.4.2 [17] (not reduced) and Sage ijH) 
(reduced) respectively. Furthermore, LinBox's FFPACK |9| performs the same task 
over the prime field F3 (reduced) in 4.55s. For comparision, the closed source system 
Magma |6 1 can compute the reduced row echelon form of a dense 4, 000 x 4, 000 ma- 
trix over F4 in 0.64s (reduced) and the same operation over F2 takes 0.054s using the 
M4RI library 13). 

In this work, we present the M4RIE library which implements efficient algorithms for 
Unear algebra with dense matrices over F2e for 2 < e < loQ As the name of the 
library indicates, it makes heavy use of the M4RJ library [1] both directly (i.e., by 
calling it) and indirectly (i.e., by using its concepts). The contributions of this work are 
as follows. We provide an open-source GPLv2h- C library for efficient linear algebra 
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over with 2 < e < 10. In this library we implemented an idea due to Bradshaw 
and Boothby f5] which reduces matrix multiplication over Fp^ to a series of matrix 
multiplications over ¥p. Furthermore, we propose a caching technique - Newton-John 
tables - to avoid finite field multiplications which is inspired by Kronrod's method 
("M4RM") |4, 2 1 for matrix multiplication over F2. Using these two techniques we 
provide asymptotically fast triangular solving with matrices (TRSM) and PLE-based 
IIT3I Gaussian elimination. As a result, we are able to significantly improve upon the 
state of the art in dense linear algebra over F2= with 2 < e < 10. For example, the 
above benchmark is completed in 0.4s by our library. 



2 Notation 

We represent elements in F2e = F2[a;]/(/), with / G F2[x], deg(/) — e and / 
irreducible, as polynomials J2iZo ^-j^' coefficient vectors (og-i, . . . , uq) where 
a.i € ¥2- We sometimes identify the coefficient vector {a^-i, ■ ■ ■ , ao) with the integer 
IZiZo Q.i2*, e.g., when indexing tables by finite field elements. By a we denote some 
root of the primitive polynomial / of F2e . By Ai we denote the i-th row of the matrix 
A and by Aij the entry in row i and column j of A. We start counting at zero. We 
represent permutation matrices as LAPACK-style permutation vectors. That is to say, 
that for example the permutation matrix 

"100' 
1 
1 

is stored as P = [0, 2, 2], where for each index i the entry Pi encodes which (row 
or column) swap should to be performed on the input matrix. This allows to apply 
permutations in-place. 



3 Matrix representation 

The M4RIE library features two matrix types, each of which is optimised for certain 
operations. Both representations use one or more M4RI matrices as data storage and 
hence re-use M4Rrs matrix window concept ||2l, allocation routines and data struc- 
tures. 



3.1 packed: mzed_t 

Considering the polynomial representation of elements in F2e , we may bit-pack several 
such elements in one machine word. Since we are using the M4RI library as actual data 
storage, this means that words hold 64 bits |2|. Hence, the mzed_t data type packs 
elements of F2E in 64-bit words. However, instead of packing as many elements as 
possible into one word, every element is padded to the next length dividing 64. Thus, 
for example, elements in F32 are represented as polynomials of degree 8 where the 
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top three coefficients are always zero. While this wastes some storage space and CPU 
time, it allows for more compact code by reducing what cases have to be considered. 
The second row of Figure[T]gives an example. 

In this representation additions are very cheap since we can disregard any element 
boundaries and simply call M4RI's addition routines. Scalar multiplication, on the 
other hand, is much more expensive. Either we perform a table look-up for each ele- 
ment or we perform bit operations on words which perform multiplication and modular 
reduction in parallel on all elements of a word. In either case, multiplication is consid- 
erably more expensive than addition. 



3.2 sliced: mzd_slice_t 



Instead of representing matrices over ¥2" as matrices over polynomials we may repre- 
sent them as polynomials with matrix coefficients. That is, for each degree we store 
matrices over F2 which hold the coefficients for this degree. Hence, the data type 
mzd_slice_t for matrices over ¥2' internally stores e-tuples of M4RI matrices, i.e., 
matrices over F2. We call each M4RI matrix for some degree i a slice and refer to the 
operation converting from mzed_t to mzd_slice_t as slicing. The inverse operation 
is called clinging. The third row of Figure [T] gives an example of the mzd_slice_t 
representation. 

Addition is performed by adding each slice independently and hence is quite efficient. 
Scalar multiplication, on the other hand, has to rely on similar techniques as in mzed_t . 
However, in addition, data locality in mzd_slice_t is worse than in mzed_t. Thus, 
here too, scalar multiplication is much more expensive than addition. 

Thus, in this work, we present algorithms for matrix multiplication and elimination 
where we avoid many scalar multiplications. 



1 
1 1 



a'' + 1 a 
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1 

1 
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Figure 1:2x2 matrix over Fg 
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4 Newton- John tables 



To explain the main idea behind Newton- John tables, consider matrix multiplication as 
given in Algorithm[T] 



Algorithm 1: Cubic matrix multiplication 
Input: A-m X £ matrix 
Input: B - £ X n matrix 
Output: C = A- B 

1 begin 

2 for < i < m do 

3 for < j < ^ do 

4 \_Cj ^ Cj + Aj^, X Bi\ 

5 return C; 



This algorithm uses m-£-n finite field multiplications and the same number of additions. 
That is, in line|4]the row Bj is scaled by Aj^i and then added to the row Cj . Observe that 
Bj is rescaled ^-times, while there are 2*^ different values for Aj^i and hence multiples 
of Bj. Indeed, if 2"^ < £ it is advantageous to precompute all possible 2'^ multiples 
of Bj and to store these multiples in a table indexed by finite field elements. These 
precompuation tables are quite similar to Kronrod's method for matrix multiplication, 
also sometimes referred to as "greasing". Hence, we call these tables Newton-John 
tables to honour Olivia Newton- John's work in 1 14|. 

We also note that we create these tables in less than 2'^ multiplications. That is, we first 
compute a* • Bj for all < * < e. Then, we compute each multiple of Bj as a linear 
combination of (a" ■ Bj, . . . q!^~^ ■ Bj) which we just computed. Using Gray codes for 
the addition step we can thus construct all 2"^ multiples using 2"^ additions [,12J . The 
subroutine creating these tables is given in Algorithmic 

The complete algorithm is given in Algorithm|3]which costs m ■ {2'^ + £) ■ n additions 
and m ■ e ■ n multiplications. Note that e is a constant here and asymptotically we thus 
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achieve 0(n^) additions and 0(n^) multiplications. 



Algorithm 2: MakeTable 
Input: B - an 1 x n matrix 

Output: T - a 2"^ X n matrix with each row a multiple of B 

1 begin 

2 AI <^ e X n matrix; 

3 T ^ 2*^ X n matrix; 

4 for < fc < e do 

5 \_ Mk ^ a'' ■ B; 

6 T f- all linear combinations of rows of M; 

7 return T; 



Algorithm 3: Newton- John multiplication 
Input: A-m X £ matrix 
Input: B - 1 X n matrix 
Output: C = A- B 

1 begin 

2 for < i < m do 

3 T ^ MAKETABLE(Bi); 

4 for < j < ^ do 

5 X -(r- Aj,i as an integer; 

6 Cj <— Cj + Tx', 

7 return C; 



Many variants of this basic algorithm are possible. For instance, we may use more 
than one Newton-John table or process the data in blocks for better cache friendliness 
(cf., lIJl for both techniques). Furthermore, if 2"^ is too big to precompute T we may 
precompute only M (cf.. Algorithmic and perform e additions in line|6] Our library 
uses eight Newton- John tables and processes matrices in blocks that fit into L2 cache. 
Since e is small we always compute the full table T. 

Of course, this algorithm is not asymptotically fast. Hence, we only use it as a base case 
for the Strassen-Winograd algorithm |20| for matrix multiplication which has complex- 
ity 0{n}°^'^ ^) . In our implementation we cross over to the base case roughly when the 
submatrices fit into L2 cache; however, the exact value depends on the size of the finite 
field. Table [U lists CPU times multiplying two 1,000 x 1,000 matrices in our im- 
plementation of Strassen-Winograd on top of Newton- John multiplication (abbreviated 
S-W/N-J in the following), in Magma and GAP (cf., AppendixlAlfor a brief discussion 
of Magma's and GAP's implementations). Note that the hex string in the header of the 
last column indicates which revision of the public source code repository!! was used to 
produce these times. 

■^cf., [https : //bitbucket . org/m4riel 
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e 


Magma 


GAP 


S-W/N-J 




2.15-10 


4.4.12 


6b24b839a46f 


2 


0.013s 


0.216s 


0.012s 


3 


0.036s 


0.592s 


0.020s 


4 


0.074s 


0.588s 


0.022s 


5 


1.276s 


1.568s 


0.048s 


6 


1.286s 


1.356s 


0.059s 


7 


1.316s 


1.276s 


0.082s 


8 


1.842s 


1.328s 


0.160s 


9 


3.985s 


64.700s 


0.626s 


10 


4.160s 


59.131s 


1.080s 



Table 1: Multiplication of 1, 000 x 1, 000 matrices on 2.66 Ghz Intel i7 



4.1 Gaussian elimination 



Newton- John tables can also be used in Gaussian elimination, as shown in Algorithm|4] 
This algorithm uses r ■ {n + 2'^) ■ n additions and r • (e + 1) • n multiplications, which 
gives an asymptotic complexity of ©(rt'^) additions and ©(n^) multiplications. Again, 
a variety of variants are possible such as multiple Newton- John tables (similar to ||2J). 
Our implementation uses six Newton-John tables. 



Algorithm 4: Newton- John Gauss elimination 



Input: A-m X n matrix 

Output: r - the rank of A 

Result: A is in reduced row echelon form 

begin 

r ^ 0; 

for < j < n do 
for s < i < m do 
if J ^ then 

A, ^ at] ■ A,- 
swap the rows i and r in A; 

T ^ MAKETABLE(Ar); 

for < fc < m do 

if fc = s then continue; 

X ^ Ak,j as an integer; 

Ak^Ak + T^; 



r r - 
break; 



1; 



return r; 



In Table|2]we give CPU times for computing the reduced row echelon form of random 
1,000 X 1,000 matrices over in Magma, GAP and our implementation. Note 
that gap's SemiEchelonMat command does not compute the reduced row echelon 
form. Hence, to normalise the data we multiplied all GAP timings in Table[2]by two. 
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e 


Magma 


GAP 


Newton-John 




2.15-10 


4.4.12 


6b24b839a46f 


2 


0.028s 


0.184s 


0.012s 


3 


0.045s 


0.496s 


0.019s 


4 


0.054s 


0.560s 


0.022s 


5 


0.690s 


1.224s 


0.042s 


6 


0.670s 


1.168s 


0.048s 


7 


0.700s 


1.104s 


0.060s 


8 


0.866s 


1.136s 


0.081s 


9 


1.523s 


35.634s 


0.427s 


10 


1.540s 


36.154s 


0.831s 



Table 2: Elimination of 1, 000 x 1, 000 matrices on 2.66 Ghz Intel i7 

4.2 PLE decomposition 

Algorithmic an be modified to compute the PLE decomposition instead of the row 
echelon form. Since this definition is lesser well-known we reproduce it below. For a 
more detailed treatment of PLE decomposition see [ 13 1. 

Definition 1 (PLE). Let A be a m x n matrix over a field K. A PLE decomposition 
of A is a triple of matrices P, L and E such that P is a m x m permutation matrix, L 
is a unit lower triangular matrix, and E is a m x n matrix in row-echelon form, and 
A = PLE. 

Lemma 1 ( 1131 ). A PLE decomposition of any mx n matrix A can be stored in-place 
in A. 

For the sake of simplicity, we compute a minor variant of PLE in our library. That is, 
L is not necessarily unit lower triangular, i.e., we do rescale the pivot row get leading 
entry 1. Then, the changes necessary for Algorithm]?] to compute this variant of PLE 
decomposition (in-place) are: 

• Store i and j in two vectors P and Q in line|6l 

• only start addition in column c + 1 in line[T2]in order to keep L in place; 

• perform column swaps below and on the main diagonal right before line [TSl to 
compress L. 

An alternative perspective on Newton-John-table based PLE decomposition is to con- 
sider it as a block iterative PLE decomposition (cf., f3l) with Newton- John table based 
multiplication updates to the right hand side. 

Note, this algorithm is not asymptotically fast, hence its main application is as a base 
case for asymptotically fast PLE decomposition |13| which reduces PLE decomposi- 
tion to matrix multiplication. For this, one last building block is needed: 
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4.3 TRiangular Solving with Matrices 



Triangular system solving with matrices can also be achieved using Newton- John ta- 
bles. As a example, we given an algorithm for solving X = U^^ ■ B with U upper 
triangular in Algorithm|5] We note that Algorithm|5]is essentially block iterative TRSM 
with Newton- John table based multiplication. Yet, we present it here for completeness. 



Algorithm 5: Newton- John TRSM upper left 
Input: U - m X m upper triangular matrix 
Input: B -m X n matrix 
Result: X = U^^ ■ B is stored in B 

1 begin 

2 for m > z > do 

3 B^ ^ ■ B,; 

4 T ^ MakeTable(Bj); 

5 for < < i do 

7 Bj ^ Bj + T,; 



5 Karatsuba multiplication 

Recall that mzd_slice_t represents matrices over as polynomials with matri- 
ces over F2 as coefficients. Using this representation, matrix multiplication then can 
be accomplished by performing polynomial multiplication and subsequent modular 
reduction. For example, assume we want to compute C ~ A ■ B where A and B 
are over F22. We rewrite A as Aix + Aq and B as Bix + Bq, the product is then 
C = AiBix'^ + (AiBo + AqBi)x + AqBq which reduces to C = [AiBi + AiBq + 
AqBi)x + AqBq + AiBi modulo the primitive polynomial / = + x + 1 of F22. 
Hence, matrix multiplication over F2.! can be reduced to matrix multiplication and ad- 
dition over F2. Using naive polynomial arithmetic we get that matrix multiplication 
over F2= costs e? matrix multiplications over F2. However, using Karatsuba polyno- 
mial multiplication we can reduce this to e'°S2 3 ^ e^-^*''. To get back to the above 
example, we can rewrite it as C = {{Ai + Ao){Bi + B^) + AoBo)x + A^Bq + AiBi 
and hence multiplication costs 3 instead of 4 multiplications over F2. This was first 
explicitly proposed for matrices over Fpn by Bradshaw and Boothby in [5 1. However, 
this technique has been used for linear algebra over with 2 < fc < 4 in Magma for 
some time jlSl . 

Concrete costs are given in Table[3]where the first column lists the CPU time for mul- 
tiplying two 4, 000 X 4, 000 matrices using Strassen-Winograd on top of Newton- John 
multiplication, the column "A/" indicates how many 4, 000 x 4, 000 matrix multipli- 
cations over F2 can be achieved in the same time using the M4RI library, the column 
"naive" lists how many multiplications would be needed by naive polynomial multipli- 
cation, the column " lfT6l " lists the best known complexity for Karatsuba-like formulas, 
the last column shows the number of multiplications which our Karatsuba-like imple- 
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mentation actually achieves. The absolute time of our bitsliced implementation is given 
in the column "Bitslice". Table [3]also compares our implementation with the previous 
two best implementations GAP and Magma . 

However, we note that Karatsuba based multiplication needs more memory than Strassen 
on top of Newton- John multiplication. Our implementation uses three temporary ma- 
trices over F2. We finish this section by pointing out in principle more efficient poly- 
nomial multiplication algorithms for ¥2[x] can be applied (cf., [7j). However, due to 
the small degrees considered in this work it does not seem advantageous. 



e 


Magma 
2.15-10 


GAP 
4.4.12 


S-W/N-J 


Bitslice 


M 


naive 


(16) 


Bitslice/ 
M4RI 


2 


1.220s 


12.501s 


0.630s 


0.224s 


8.8 


4 


3 


3.1 


3 


2.020s 


35.986s 


1.480s 


0.448s 


20.8 


9 


6 


6.3 


4 


5.630s 


39.330s 


1.644s 


0.693s 


23.1 


16 


9 


9.7 


5 


94.740s 


86.517s 


3.766s 


1.005s 


53.0 


25 


13 


14.2 


6 


89.800s 


85.525s 


4.339s 


1.336s 


61.1 


36 


17 


18.8 


7 


82.770s 


83.597s 


6.627s 


1.639s 


93.3 


49 


22 


23.1 


8 


104.680s 


83.802s 


10.170s 


2.140s 


143.2 


64 


27 


30.1 



Table 3: Multiplication of 4, 000 x 4, 000 matrices over on 2.66 Ghz Intel i7. 



6 Echelon Forms 

Putting these building blocks together 

(1) Karatsuba multiplication, 

(2) Newton-John-based PLE decomposition, 

(3) asymptotically-fast PLE decomposition, 

(4) Newton-John-based Triangular Solving with Matrices (TRSM) and 

(5) asymptotically-fast TRSM, 

we can construct asymptotically fast Gaussian elimination, e.g., computation of (re- 
duced) row echelon forms (cf., [13]). Our implementation uses mzd_slice_t as 
representation for large matrices and switches over to mzed_t when the submatrix 
currently considered fits into L2 cache. Table |4] lists CPU times for computing the 
(reduced) row echelon form using Magma (reduced), GAP (not reduced) and our im- 
plementation (reduced). Note that our implementation as of now only implements 
asymptotically fast PLE decomposition up to e = 8, for e e {9, 10} Newton- John 
table based Gaussian elimination is used. 
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e 


Magma 


GAP 


M4RIE 




2.15-10 


4.4.12 


6b24b839a46f 


2 


6.040s 


162.658s 


3.310s 


3 


14.470s 


442.522s 


5.332s 


4 


60.370s 


502.672s 


6.330s 


5 


659.030s 


N/A 


10.511s 


6 


685.460s 


N/A 


13.078s 


7 


671.880s 


N/A 


17.285s 


8 


840.220s 


N/A 


20.247s 


9 


1630.380s 


N/A 


260.774s 


10 


1631.350s 


N/A 


291.298s 



Table 4: Elimination of 10, 000 x 10, 000 matrices on 2.66 Ghz Intel i? 
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A Other implementations 

GAP jfTll packs finite field elements of size 2 < s < 2^ into words using 8 bits 
per entry. Arithmetic is implemented using table look ups. Multiplication is 
performed using cubic matrix multiplication. Row echelon forms are computed 
using cubic Gaussian elimination. 

LinBox/FFPACK uses floating point numbers to represent finite field elements. 
For extension fields, elements are represented as "sparse" integers, such that 
there are enough zeroes between two coefficients to avoid the carry travelling 
too far. However, this feature is not readily exposed to the end-user and requires 
some tweaking to work. FFPACK implements Strassen- Winograd multiplication 
and asymptotically fast PLUQ decomposition for Gaussian elimination. 

Magma [6] implements asymptotically fast matrix multiplication and reduces Gaus- 
sian elimination to LQUP decomposition. For F2fc with 2 < A: < 4 a bit- 
sliced representation similar to our mzd_slice_t is used in combination with 
Karatsuba-like formulas for polynomial multiplication. For 5 < /c < 20 el- 
ements in are represented using Zech logarithms. For larger k a packed 
polynomial representation is used similar to our mzed_t IITSl . 
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